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ADAPTIVE NONPARAMETRIC CONFIDENCE SETS 

By James Robins and Aad van der Vaart 
Harvard University and Vrije Universiteit Amsterdam 

We construct honest confidence regions for a Hilbert space- valued 
parameter in various statistical models. The confidence sets can be 
centered at arbitrary adaptive estimators, and have diameter which 
adapts optimally to a given selection of models. The latter adaptation 
is necessarily limited in scope. We review the notion of adaptive confi- 
dence regions, and relate the optimal rates of the diameter of adaptive 
confidence regions to the minimax rates for testing and estimation. 
Applications include the finite normal mean model, the white noise 
model, density estimation and regression with random design. 

1. Introduction. Consider an observation distributed according to 

(n) 

a law Pq depending on a parameter 9 that ranges over a subset 6 of a sep- 
arable Hilbert space. Specifically, we take the Hilbert space equal to M n with 
the Euclidean norm, or the sequence space £2 = {9 = (#1, 02, ■ ■ • ) : Ya^i $1 < 
00} with the squared norm ||#|| 2 = Ya^i Our aim is to construct (asymp- 
totic) confidence sets C n of small diameter for the parameter 9, which are 
"honest" in the sense that, for a given confidence level 1 — a, 

(1.1) liminf inf P 9 (9 6 C n ) > 1 - a. 

n-»oo 6»G0 

This problem has been considered by, among others, Li [32] and Baraud [1] in 
the case that is equal to M n and the observation is a Gaussian vector with 
mean 9 and covariance matrix the identity, by Hoffmann and Lepski [20] in 
the case that 9 € £2 and the observation is an infinite sequence of Gaussian 
variables with means 9i and variance <7 2 /n, and by Beran [4], Beran and 
Diimbgen [5] and Genovese and Wasserman [18] in the case of the fixed 
design regression model. Our aim in this paper is to propose new confidence 
procedures for these and related models, which shed light on some of the 
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questions raised in the discussion of the paper by Hoffmann and Lepski [20]. 
We construct confidence sets with the properties: 

(i) The confidence set is honest on the model 0. 

(ii) The confidence set is centered at an estimator of choice, for example, 
an adaptive estimator. 

(iii) The diameter of the confidence set adapts to submodels of in a 
rate-optimal way. 

In the second and third points we improve on the results in the mentioned 
papers, at least as regards rates. Our method in its simplest form as pre- 
sented below leads to an increase of the "constants." 

Since completing our paper we have learned about the work of Juditsky 
and Lambert-Lacroix [25] and Cai and Low [11]. Juditsky and Lambert- 
Lacroix [25] appear to deserve priority in discussing adaptive confidence 
sets. In their beautiful paper they pose the problem within the setting of 
fixed-design regression with Gaussian errors and obtain adaptation in the 
scale of Besov spaces, using wavelet-based methods. An insightful discussion 
of the problem and basic insights about its relationship to loss estimation 
and minimax estimation and testing can already be found in this paper. 
Cai and Low [11] consider the problem of adaptive confidence regions in 
the setting of the Gaussian white noise model, and obtain adaptation in the 
scale of Besov spaces, also using wavelet-based estimators. Our method is 
more flexible and applies to more settings, but we develop the results only 
for the scale of Sobolev spaces. In certain respects it is close to the method 
of Juditsky and Lambert-Lacroix [25]. 

As is pointed out in the preceding references, the desired honesty (i) 
severely limits the possibility of adaptation as in aim (iii). In the past years 
many successes have been obtained in the construction of estimators that 
are simultaneously minimax over a large selection of models. (See, e.g., [2, 3, 
13, 14, 15, 16, 19, 29, 30, 31, 34, 37].) These estimators are able to adapt to 
the "regularity" of the true underlying parameter, without pre-knowledge of 
the parameter or its regularity. However, as pointed out by Birge [7], these 
estimators have the property of being close to the true parameter without 
the statistician being able to tell how close it is. An adaptive estimator can 
adapt to an underlying model, but does not reveal which model it adapts 
to, with the consequence that nonparametric confidence sets are necessarily 
much wider than the actual discrepancy between an adaptive estimator and 
the true parameter. 

If one drops "honesty" (i) from the requirements of the confidence set, 
but requires, for instance, only that the confidence set is honest over every 
submodel ©i C of interest [i.e., (1.1) with O replaced by ©i], then this 
embarrassing problem disappears, and it is possible to construct "confidence 
sets" of a diameter that adapts to the estimation rate. Most procedures 
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in the literature fall in this category. However, dropping full honesty (i) 
appears to contradict the very definition of a confidence set. In this paper we 
require honesty in the sense of (1.1) with G the collection of all parameters 
deemed possible. Thus we consider a list of models and require honesty on 
the "biggest model" in the list. 

Under this requirement the possibilities for adaptation are severely lim- 
ited. For a given submodel ©i C 0, the diameter of a confidence region that 
is honest for cannot be of smaller order, uniformly over ©i, than: 

(a) The "slowest rate" e n — > such that for any estimator sequence T n 
and some (3 > a 



This is typically the minimax rate of estimation for the model 0i. 

(b) The minimax rate of testing of the hypothesis H :fl£0j versus the 
alternative Hi : 9 S 0, \\6 — Q[\\ > e n , for any given @[ C ©i, for example, a 
one-point set ©^ = {6\}. This rate is often determined by the full model 0, 
rather than the submodel ©i. 

These lower bounds appear to be well known. Juditsky and Lambert-Lacroix 
[25] discuss such bounds in the setting of Besov spaces. For completeness we 
give precise statements in Section 6. 

Our confidence sets have diameter of the order the maximum of the 
rates in (a)-(b), simultaneously over many submodels, at least for regularity 
classes as in the following example, and hence satisfy aim (iii). 

Example 1.1 (Regular parameters). A parameter 9 £ £2 can be called 
(5-regular (for a given > 0) if it belongs, for some L > 0, to the ellipsoid 



If the coordinates of 9 correspond to classical Fourier coefficients, then 
S(f3, L) corresponds to periodic functions with (3 derivatives bounded by 
a multiple of L in L2[0, 1]. (For real functions and the sine-cosine basis the 
correspondence is more accurate if we replace i 2/3 by (i — V) 2 " for odd values 
of i. See, e.g., the Appendix of [38].) 

Consider inference on 9 £ S(f3,L) based on observing each 9i with an 
independent N(0,a 2 /n) error, or in one of the other models discussed be- 
low, which yield similar results. The minimax estimation rate for S((3,L) 



is n -/3/(2/3+i) ( cf #) e _ g-) [ 9) 21, 22, 36]). For 0i > (3 and L x < L we have 
5(/3i,Li) C S((3,L) and the minimax testing rate of S{(3i,L\) relative to 
S({3, L) in the sense Of (b) is ra -/3/(2/3+l/2) <n -/3/(2/3+l)_ (g ee ^ TheQ _ 

rem 2.1 or 3.1, or [24].) 



liminf sup P e (\\T n -9\\>e n )>f3. 
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Table 1 

Order of maximal diameter of confidence regions on the 
submodel S(j3i,Li) C S((3,L), cut-off points and number of 
observations needed to estimate o 2 



p 


Pi 


Radius on S{[3±,L\) 


Cut-off 


Obs for a 


1 


>2 




n 2 ' 5 


»n 2 / 5 


1 


3/2 


n" 3 / 8 


n 2 ' 5 


»n 2 / 5 


1 


1 




n 2 / 5 


»n 2/5 


1/2 


>1 




n 2 ' s 


»n 2 / 3 


1/2 


3/4 




n 2 / 3 


»n 2 / 3 


1/2 


1/2 




n 2 / 3 


»n 2 / 3 


1/4 


>l/2 




n 


»n 


1/4 


1/4 




n 


>n 


1/8 


>l/4 




n 4 / 3 


»n 4/3 





>o 


1 


,r 


>n 2 



If the supermodel is equal to S(f3,L), then these bounds suggest that 
the diameter of a confidence set can be of diameter of order n~^/^ 2 ^ +l ^ 
uniformly over 0, and of order n - ^ 1 ^ 2 ^ 1 " 1 " 1 ) V ?7,~^/( 2 ^ +1 / 2 ) uniformly over 
the smaller model Gi = S{/h,Li) for fa > j3 and L x < L. If ft G 2/3), 
then the latter rate is equal to n~^ ly/ ( 2 ^ 1+1 ) and depends on the submodel. 
In that case we may say that adaptation occurs. 

This type of adaptation is very different from adaptation in the context 
of estimation. For 13\ > 2/3 the diameter is n - ^ 2 ^ +1 / 2 ) , independent of the 
exact value of ft, so that further regularity does not yield smaller confi- 
dence regions. Even on very small submodels (ft — > oo), the diameter of a 
confidence region is at least of the order n~^/( 2 ^ +1 / 2 ), determined by the 
supermodel. As illustration, Table 1 gives the rates for some values of the 
regularity parameters. The meaning of the last two columns of the table is 
explained later on in the paper. 

Our method to construct confidence regions, described in Section 2, is 
based on a sample-splitting procedure. We use half the data to construct 
centering estimators 6^ n \ and an independent second half to construct a 
confidence region around §( n K The nature of the initial estimator 9^ is 
irrelevant for the honesty of the confidence procedure, and hence 9^ can 
be any of our favorite estimators. In particular, it can be an estimator that 
adapts to a selection of models of our choice. Our procedure borrows its 
adaptive strength from these initial estimators, but of course only up to the 
limitations described earlier. 

Refinements of this procedure would be to construct two confidence sets, 
with the roles of the two half-samples interchanged, and to take the inter- 
section, or to split the sample into more parts. For restricted supermodels 
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the splitting may be avoided altogether. This may lead to better con- 
stants in the centering and diameter of the confidence set. In this paper we 
are interested in rates only, and for this our simple sample-splitting scheme 
suffices. 

In the case that the observations are a random sample, we can form the 
two halves of the data by simply splitting the sample into two parts, using the 
first half-sample to construct the estimator 6^ and the second to construct 
the confidence region. In other examples of interest a similar situation can 
be created using a more involved splitting device, which we describe below. 

The organization of the paper is as follows. In Section 2 we describe the 
construction in a general framework. In Sections 3, 4 and 5 we give the 
details for the three main examples, sequence models, density estimation 
and random regression. Finally in Section 6 we relate the diameter of a 
confidence region to the testing and estimation rates. 

We close this introduction with a description of a number of examples to 
which our construction applies, together with a review of the literature. 

Example 1.2 (Finite sequence model). In this model the observation is 
a vector = (X±,X2, ■ ■ ■ , X n ) from an n-dimensional normal distribution 
with mean vector 6 = (61,62, ... ,6 n ) and covariance matrix (a 2 /n)I. The 
variance a 2 is known and the parameter 6 is known to belong to a subset 
of R n , which may be all of R n . 

This model was studied in [32] and [1] under the assumption that G = M. n . 
The naive procedure in this situation is the chi-square region {6 € W 1 : \\6 — 
A"( n )|| 2 < (a 2 /n)Xn which derives from inverting the likelihood ratio 

test. It has diameter of order 1, uniformly in (and independently of) 6. 

Li [32] showed that requiring honesty relative to all parameters 6 € M. n 
implies that no confidence region can achieve a diameter that is uniformly 
smaller than n -1 ' 4 , and exhibits confidence regions around shrinkage esti- 
mators that may achieve the rate n _1//4 on the submodel where the shrinkage 
estimator performs well. Li's confidence sets improve on the naive chi-square 
procedure at true parameters where the shrinkage estimator improves upon 
the naive estimator X^ n \ Baraud [1] constructs confidence regions that im- 
prove on the naive procedure in a wider range of submodels. His procedure 
is based on comparing a range of submodels by chi-square tests. The confi- 
dence regions in the present paper manage to adapt to still more submodels, 
if the initial estimators are chosen so as to fully profit from the recent 
insights in adaptive estimation, such as in [8]. 

It is notable that in this model the variance a 2 is assumed known. Ba- 
raud [1] shows that in the case that a 2 is an unknown parameter ranging 
over some interval (even a very short one) , confidence regions that are honest 
over = W 1 and a 2 can never have diameter less than order 1. 
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Because the observations in this example are non-i.i.d., splitting the sam- 
ple is not a good device in order to separate constructing a center and a 
radius of the confidence region. However, we may artificially produce two 
normal vectors X' and X" with means 6 from a given N n {9, (a 2 /n)I)- 
distributed random vector X using randomization. Given a sample of in- 
dependent, uniform variables Ui independent of X , it suffices to define 

X' i = X i + Q- 1 (Uda/yfii, 
X'l = X l -$-\U i )a/^i. 

Then it can be verified that X[ and X" are independent random variables 
with means 0^ and variances 2a 2 /n. Thus the observations can be duplicated 
at the cost of multiplying the variance a 1 by 2. In the remainder of the paper 
we shall assume that a device of this type has been applied, and write X^ 
for the second sample (on which the estimate of the radius of the confidence 
set is based), and assume that this is independent of the initial estimator 
6^ for 6. 

Knowledge of a 1 is crucial for this randomization step. Good estimators 
would do as well, but it is impossible to estimate a 1 in this model without 
restricting the mean parameter 9 to a proper subset of W 1 . Baraud [1] 
shows that the size of a confidence set can never be of smaller order than 
the imprecision in a. 

Example 1.3 (Infinite sequence model). In this model the observations 
are an infinite sequence X^ = (X\, X2, ■ ■ ■) of independent random vari- 
ables Xi possessing normal distributions with means EJQ = 9t and variance 
a 2 /n. The parameter is the mean vector 6 = (61,62, ■■ ■) and is known to 
belong to a subset G of £2- 

This model is a version of the white noise model, and is considered in 
connection to confidence regions in Hoffmann and Lepski [20] . (The focus of 
these authors is on "random normalizing constants" rather than confidence 
regions, but, as most of the discussants of their paper, we interpret their 
results with respect to their implications for confidence regions.) Hoffmann 
and Lepski [20] assume that there is a largest model Q of interest, and 
exhibit confidence regions that are adaptive to finitely many submodels. 
Our construction allows infinitely many submodels and yields confidence 
regions around arbitrary initial estimators 6^ n \ for example, adaptive ones. 
Hoffmann and Lepski consider the general setting of anisotropic regression 
models, but we illustrate our method for the regularity classes of Example 1.1 
only. 

We can use the same device as in Example 1.2 to duplicate the observa- 
tions, at the cost of doubling the variance a 2 . 
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Typically one chooses to be a relatively small subset of £2- Then it 
is easy to find good estimators of a 2 , and it is not necessary to assume 
that o is a priori known. For instance, if is an ellipsoid of the form 
{9 £ I2 '■ 9 2 i 2/3 < 00}, then we may base an estimate of a 2 on the ob- 
servations Xk+i, Xk + 2, ■ ■ ■ , Xk +m for sufficiently large integers k,m, which 
are approximately -/V(0, cr 2 )-distributed for large k. The availability of an 
infinite sequence allows one to control the bias and variance of estimators 
of a 2 to arbitrary precision by choosing k and m, respectively, sufficiently 
large. 

Example 1.4 (Density estimation). In this model the observation is an 
i.i.d. sample X±, . . . ,X n from a density / relative to some measure /u on a 
measurable space (X,A). The density / is known to belong to a subset J- 
of L 2 (X,A,n). 

We can cast this example into a problem of estimating a sequence 9 = 
(9i , 9 2 , . . . ) of parameters by expanding / on a fixed orthonormal basis 
ei,e2, ... of L 2 (X ', A, fJ,) . This expansion takes the form of the Fourier se- 
ries / = YU ®i e i-> f° r the Fourier coefficients 9i = (/, e,) = Eej(Xi). 

The empirical Fourier coefficients Y{ = n~ l Y^j=i e i(Xj) are unbiased es- 
timators of the parameters 6%. However, they are only approximately nor- 
mally distributed and not independent, and it seems not fruitful to cast this 
example into the framework of the sequence model of Example 1.3 with ob- 
servational vector (Y\ , Y 2 , . . . ) ■ The Le Cam equivalence of the white noise 
model and the density estimation model, proved under conditions by Nuss- 
baum [35], offers a different connection between the two examples, but can 
be used only if T is restricted and yields regions of complicated form. (The 
latter objection is alleviated by the recent constructions of Brown, Carter, 
Low and Zhang [10].) Our direct approach gives concrete confidence sets 
and in wider generality. 

We can split the sample into two independent halves to construct the 
center 9^ and the radius R n {9^) of the confidence set. 

There is no parameter a 2 to be dealt with in this example. 

Example 1.5 (Random regression). In this model the observation is an 
i.i.d. sample (Xi,Y\), . . . , (X n ,Y n ) from the distribution of a vector (X,Y) 
described structurally as Y = f(X) + e, for (X,e) a random vector with 
E(e\X) = and E(e 2 \X) < 00 almost surely. The regression function / is 
known to belong to a subset T of L 2 (X,A, Px) for Px the marginal dis- 
tribution of X, which is assumed known. The variance function cr 2 (x) = 
Fi(s 2 \X = x) need not be known, although for confidence intervals that are 
honest in a 2 we need a known upper bound. We do not assume that the 
errors are normally distributed, and we do not assume that X and e are 
independent. 
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As in Example 1.4 we can cast this example into a problem of estimating a 
sequence 9 = 62, ■ ■ ■ ) of parameters by expanding / on a fixed orthonor- 
mal basis ei,e2,... of Z^OY,*^., -Px")- The Fourier coefficients take the form 
e i = (f,e i ) = Ee i (X)Y. 

The Fourier coefficients can be estimated unbiasedly by the estimators 
Zi = n _1 Yjj=i Yjei{Xj), but, as in Example 1.4, it appears not useful to try 
and reduce the model to the sequence model of Example 1.3 by considering 
(Z±,Z2, • • • ) as the observation. 

The assumption that the design distribution Px is known may be real- 
istic in some practical situations, but is unpleasant. Perhaps it is a little 
surprising that it is not a merely technical assumption, but essential for 
the construction of our confidence sets. We intend to show elsewhere that 
the radius of the confidence sets will increase if Px is unknown, in varying 
amount, depending on what a priori assumptions are made on Px- If Px 
is completely unknown, then intuitively this model should be equivalent to 
the fixed design regression model discussed in Example 1.6. 

Example 1.6 (Fixed regression). In this model the observation is a vec- 
tor Y = (Yi, . . . ,Y n ) of independent random variables distributed according 
to the regression model Yi = f(xi) + £j, for Ei,. . . ,e n i.i.d. normal variables 
with Eej = and Ee? = a 2 and x\,...,x n known constants. The variance a 2 
is known and the function / is known to belong to a subset J- of Li2(X,A., (jl) 
for some distribution fx. 

Genovese and Wasserman [18] put this model in a sequence framework 
by expansion of the regression function on an empirical wavelet basis. They 
justify Beran [4] and Beran and Diimbgen [5] REACT confidence sets in 
terms of an honest confidence set over /3-regular regression functions /, 
described in terms of a wavelet expansion. This is also the model treated by 
Juditsky and Lambert-Lacroix [25]. 

The model can be seen to reduce to a version of the finite sequence model 
of Example 1.2. All information about the regression function / outside the 
design set {x±, . . . ,x n } must stem from the model and not from the data. 
This point was made previously in Li [32], who gives the regression model 
as motivation for studying the finite sequence model. We shall not further 
discuss this model separately. 

2. Construction of confidence regions. Our method is based on sample 
splitting. We suppose that initial estimators 9^ n > are given, and construct the 
confidence region based on 6^ and an additional independent observation 
iW. It was discussed previously how to split the data into independent 
"halves" that can be used for constructing §( n ' and l'"', The nature of the 
initial estimator is irrelevant for the honesty of the confidence procedure, 
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and hence 9^ can be any of our favorite estimators. In particular, it can be 
an estimator that adapts to a selection of models of our choice. 

Our confidence regions are based on estimators R n {9^) = R n {9^ n \X^) 
of the squared norm \\0 — 9^\\ 2 such that 

(2.1) liminf inf V e (R n (9 {n) ) - \\9 - 9 (n) \\ 2 > -z a f n e \9 {n) ) > 1 - a, 

for "scale estimators" f n fi and "quantiles" z a . The probability is computed 
conditionally given the estimators 9^ n \ and hence refers only to the obser- 
vation used to calculate R n (9^ n ' > ) and f nt g. In view of Fatou's lemma 
the unconditional coverage probability will also be at least 1 — a. Then the 
set 

(2.2) C n = {9 G G : \\9 - # n > || < y/z a f nfi + R n (9 {n) ) } 

is an honest confidence region with coverage probability at least 1 — a. (De- 
fine \fx to be if x < 0.) The confidence region C n is in general not a ball. 
However, in all our examples the scale estimators f. n g satisfy 

l|0-^ (n) ll 
T n ,e ^ T n H -= — , 

where < denotes smaller than up to a constant which is fixed by the setting 
and f n is independent of 9 and determined by the size of the parameter 
set 0. It can be seen from this that the diameter of the confidence region 
satisfies 

(2.3) diam(C n ) < Vr~+ V R n (9^) + n^ 2 . 

(See the proof of the proposition below for a precise argument.) The last term 
on the right is the parametric rate of estimation and is typically negligible 
relative to the other terms. The first term V^Vi depends on the supermodel 
and its size is typically the same on every submodel. 

The possibility of adaptation hinges on the second term. Typically (2.1) 
extends to a full, two-sided comparison, of the form \R n (9^) — \\9 — 9^ || 2 | = 
Op e (f n ft) uniformly in 9 G 0. Then it follows that the diameter of C n is of 
the order, uniformly in 9 G 0, 

diam((7 n ) = Pfl (v / ^+ ||0 (n) - 9\\ +n' 1 ' 2 ). 

The diameter of the confidence set on a given submodel ©i C is bounded 
above by the biggest order of the expression on the right-hand side under 
9, for 9 ranging over ©i. For small submodels, or more generally submodels 
where the estimators 9^ perform well, the diameter will be dominated by 
the term the rate of the estimators of \\9 — ^ n ^|| 2 . On the other hand, 
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in bigger submodels the term \\9^ — 9\\ may dominate. It is thus that we 
achieve adaptation to smaller models, but only up to the order V^n- 

It is apparent from the preceding description that our confidence regions 
depend crucially on good estimators of the squared distance \\0 — 9^ n '\\ 2 of 
the parameter 9 to the point 9^ n \ The latter point 9^ may be considered 
fixed, as we condition on the initial estimator. The problem of constructing 
such estimators is therefore closely connected to the problem of estimating 
the squared norm \\9\\ 2 of a Hilbert space-valued parameter. In some exam- 
ples this is straightforward, but in the situations of density estimation and 
regression this problem is more involved. Fortunately, in the latter cases the 
estimation of a "quadratic functional" has been studied in detail by, among 
others, Fan [17], Bickel and Ritov [6], Laurent [26, 27] and Laurent and Mas- 
sart [28], whose work obtains additional relevance in the present paper. The 
more recent papers consider adaptive estimators of the squared norm, but 
for our purposes optimal estimation under the biggest model will be suffi- 
cient. In view of their simplicity we shall adapt the constructions of Laurent 
[26, 27] to our purposes, but other approaches could be used as well. 

This method consists of estimating the squared norm \\TlkO — IIfc(9W|| 2 of 
the projection of the difference 9 — 9^> [where = 0, 0, ... )] 

unbiasedly and trading off the resulting (squared) bias versus the variance 
of the estimator. Under the assumption that 9^ takes its values in O, the 
bias is bounded by a multiple of 



(2.4) B\ :=sup||0-n fc #|| 2 . 

<?ee 

The variance turns out to be of the order, for a parameter a 2 that depends 
on the setting, 

2a 4 k 4a 2 \\U k 9-U k 9^\\ 2 



(2-5) %,n£ '■— ~o 



n 2 n 



The root ffc n ,8 of this variance and the bias B 2 must be incorporated into 
the variable f nj $ as in (2.1). We define f n = y/2a 2 \/ r k~/n + B%, and conclude, 
in view of (2.3), that the diameter of the resulting confidence set (2.2) is of 
the order 



(2.6) ^^ + B 2 k + \\9-9^\\ + 



n \ n 



We can now choose an optimal value of k by trading off A; 1 / 4 / y/n versus B^. 

The parameter a may depend on the unknown 9, but in that case must 
be uniformly bounded over the supermodel O. 

For later reference we formalize the preceding as a proposition. Rather 
than making assumptions on bias and variance, we assume that the esti- 
mation rate of the estimators Rk^ n {9^) is of the order as in the preceding 
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discussion: for Tk, n ,8 as in (2.5), some number z a and any s6C[U6nc6S k n — > oo 
and M n — > oo , 

(2.7) limsupsupP e (i? ifcn , ri (^ n) ) - \\U k J-U k J^\\ 2 < -z a T kn ,n,e) < «, 

n— >oo 0S0 

(2.8) limsupsu P P e (|i4, l ,n(# (n) ) - \\U k J-U k J^f\>M n f kn , n ,g)^0. 

n-*oo 6»G6 

Of course, the second equation implies that the first is satisfied for sufficiently 
large z a , whereas an "absolute version" of the first equation for all a G (0, 1) 
will imply the second one. 

Proposition 2.1. Suppose that R k , n (9^) are estimators that satisfy 
(2.7)— (2.8) for T% tTlt Q given in (2.5) and some a G (0,<r]. Assume that 9^ 
takes its values in 0. Then for B k given in (2.4) the sets 



C n = {0 G 6 : \\0 - < y/z ot T kntn>e + R kntn {0M) + 2B kn } 

are honest (1 — a) -confidence sets for 8 G O, for any k n — > oo, with diameter 
satisfying, for any M n — > oo, 



limsupsupPgl diam((7 n ) > M n 



^ 1/4 

+ B kn + \\e-§w\ 



n 



0. 



Proof. By (2.4) the difference \\8 — O^W is bounded above by ||IIfc(0 ■ 
9^)\\ + 2B k . Therefore, by the definition of C n , 



?e(9 i C n ) < P e (\\IL k (0 - 9^)\\ > Vz a f kn , n , e + R kn , n (6^)). 

In view of (2.7) the right-hand side is asymptotically bounded above by a, 
uniformly in 9 G 0. Hence C n is an asymptotic confidence region of confi- 
dence level 1 — a. 

In view of the form (2.5) of T k , n ,0 every element 6 of C n satisfies 



II* - ^11 < ^zj^- + R kn , n (^) + 2B kn + £0- ^e~¥n. 

The inequality x < B + A\fx for real numbers x and positive real numbers 
A and B implies that x < IB + 2 A 2 . We conclude that the diameter of C n 
is bounded by a multiple of 

The variables R knn (9^) are with Pg-probability tending to 1 bounded 
above by a multiple of 1 1 U kn 9 - U kn 0^> \ \ 2 + M n f kn in>g , for any given M n ->• oo , 
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by (2.8). Therefore, with probability tending to 1 the diameter of C n is 
bounded by 



+ \\e - (n) || +VM n f k ,n,e + B k „ + -=. 



n 

Here the last term is negligible relative to the first. The proposition follows in 
view of the form (2.5) of r k)rij g and the inequality ak l / A / ^/n + -^/cr-^/x/ra 1 / 4 + 
x < lok 1 !^ I \/n + 2x, which is valid for any k > 1, x > and a > 0. □ 

The natural (or "naive") estimators R n (9^) of \\9 — #( n )|| 2 in our ex- 
amples assume negative values, which could lead to a confidence set C n in 
(2.2) of zero diameter. This is unlike the usual situation in parametric mod- 
els, where \fn times the radius of a confidence region for 9 generally has 
the desirable property of tending in probability to a positive constant. In 
practice it might be useful to eliminate the possibility of radii of zero by sub- 
stituting for the right-hand side of (2.2) a more conservative cut-off, given 
by the maximum of the current right-hand side and V z a f nt Q (or perhaps 
Vz a f nt e/2). 

Example 2.1 (Model of dimension n). If 6 =M n , then we can avoid a 
bias by choosing k = n. Then the diameter of the confidence sets is of the 
order equal to the maximum of n" 1 / 4 and the estimation error \\9 — 9^\\. 

Example 2.2 (Regular models). The usual models to define regular 
parameter are the ellipsoids S((3, L) = {8 G i 2 : G h W < l2 }> for > 
and L > given. Suppose we choose = S(@,L) for fixed values of (3 and 
L as the supermodel, on which we require honesty, and consider adaptation 
on ellipsoids defined by different parameter values. 

If we cut off the series expansion at level k, then the maximal squared 
bias is equal to 

sup £ 9f< sup £ 9?(l) <L 

This leads to the trade-off fc 1 / 4 /-^/^ ~ L/k@, resulting in a cut-off of the 
order 

k ~ £ 4/(4/3+1) ra 1/(2/3+1/2) 

and a bias of the order „-/V(2/H-i/2) L i/(4/m) _ 

This choice of k is compatible with k <n only if > 1/4. Thus if is 
restricted to W 1 , as in the finite sequence model, then we consider submodels 
S(P,L) with p> 1/4 only. 
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For this choice of k we obtain a confidence region for the full parameter 
8 € £2 of diameter of order equal to the maximum of n~^^ 2/3+1 ^ 2 ^ and the 
estimation error \\6 — 8^ n '\\. The lower bound n _ ^/( 2 ^ +1 / 2 ) and the cut-off k 
are for some values of (3 given in the third and fourth columns of Table 1. 

Thus the role of the minimal diameter n -1 / 4 in the preceding example is 
now taken over by n _/3 /( 2/3+1 / 2 ) . 

For the initial estimators 8^ there is a variety of choices. A relatively 
simple scheme is to choose 9^ to adapt to all regularity classes S(j,M) in 
the sense that, for all 7 > j3 and all M > 0, for some constants C 7i m 5 

sup E 9 (0» - 9 f < C 7iM rr 27/(27+1) . 

0eS(7,M) 

Such estimators exist in the examples considered in the Introduction. In fact, 
there exist estimators that adapt to a much larger collection of submodels 
than only the Sobolev models considered in this paper. Combined with our 
construction this will lead to a confidence region around 9^ of diameter of 
the order n _7// ( 27+1 ) uniformly over £(7, M) if 7 € [(3, 2/3], and of the order 
n -Pl ( 2 ^+!/ 2 ) over S{y,M) for other indices 7. 

3. Sequence models. Suppose that we observe an infinite sequence X = 
(X\,X2, . . . ) of independent random variables Xi possessing means EXj = 9i 
and variances a 2 jn. The parameter 6 = (61,82, ■ ■ ■) is known to belong to a 
subset of £2 ■ This formulation encompasses both the finite and the infinite 
sequence models of Examples 1.2 and 1.3, if in the former case it is under- 
stood that 6 C 1Z n := {6 £ £2 ■ &i = 0, i > n} and that X n+ \ , X n+ 2, ■ ■ ■ may 
not be used to estimate a 2 . Our main interest is in the case where the Xi are 
also normally distributed, but we also consider the more general situation. 
The assumption of normality allows a precise and simple derivation of the 
radius of a confidence region. In a final subsection we also indicate how to 
obtain confidence sets with guaranteed level for finite n. 

3.1. Normal distributions. In this section we assume in addition to the 
preceding that each Xj is normally distributed. 

Given an initial estimator 0^ n \ based on observations that are indepen- 
dent of X, our estimator for ||(9 — #( n )|| 2 is given by 

(3.1) Rkn{ §^) = j2(x t -8l n) ) 2 -^. 

i=i 

Here k = k n is chosen dependent on and/or ©i, where we must have k <n 
in the finite sequence model. This estimator is combined with the estimator 
of variance (random only in its dependence on 8^) 

(3-2) ^ + -$■>)•. 



14 



J. ROBINS AND A. VAN DER VAART 



We shall show that Rk,n(6^) tends in distribution to a normal distribution, 
uniformly in 9 € li- This allows us to construct confidence sets of the type 
(2.2) by using normal quantiles for the values z a . [Because Rk, n and fk, n ,9 
depend in fact only on [6\, . . . ,6k), "uniformly in 6 € means effectively 
"uniformly in (0i, ...,6 k ) E R fc ."] Because R k , n {n) ) 

is a sum of independent 

variables, its asymptotic normality is not a surprise. The main contribution 
of the following theorem is that this asymptotic normality is uniform in 6, 
without any conditions on the initial estimators 6^ n '. This depends crucially 
on the normality of the observations. 

The convergence in the following theorem may be understood in the al- 
most sure sense. As the proof shows, the weak convergence is actually uni- 
form in the values §( n \ 



Theorem 3.1. For Qjfiy hfi — > oo as n — > oo, 

^k n (a. a( n )\2 



sup sup 



{ Rk n ,n(8 {n) )-U=l&-0\ 



0. 



Proof. We can express the variable (i4,n(# (n) ) - £*=i(0< - ^f) / f k,n,e 
in the independent standard normal variables £j defined by X,- t = 6, t + (a/ y/n )e, 
as 



a 2 2a k 



4? x>? - mn, k (0) + ^ =l[6t e 4r= B n ,k(0), 



i=1 \]Y,i=l\vi 

for the positive constants whose squares are given by 

= l + (2n/ka 2 )EU0 l -Ol n Y 



Bk,n(6) 



(^/2n) + Eii(^-^ n) ) 2 



By the rotational invariance of the multivariate standard normal distribu- 
tion, for any vector ip with norm 1 the random vector ((2/c)" 1 / 2 Y^=ii e 1 ~ 
l),X)i=i ^i e i) is equal in distribution to the random vector ((2A;)" 1 / 2 J2i=i( £< i ~ 
1), k^ 1 / 2 J2i=i £ i)- The latter vector tends in distribution to a vector of two 
independent standard normal variables, as in oo. The coefficients A k ^ n {6) 
and B k n (6) are contained in the unit interval and satisfy A 2 , n {6) + B^ n {6) = 
1, for any k,n,6. 
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We can complete the proof by noting that if a sequence of random vec- 
tors (X n ,Y n ) converges in distribution to a random vector (A, Y), then the 
sequence AX n + BY n tends in distribution to AX + BY, uniformly in coef- 
ficients (A, B) belonging to a compact set. □ 

The theorem shows that Rk,n{9^) is a good estimator of the squared 
norm of the projection Hk(9 — 9^) of 9 — onto the /c-dimensional sub- 
space {9 G £2 '■ 9% = 0, i > k}, and justifies (2.8) with f| n6) of the order as 
in (2.5). Thus Proposition 2.1 yields a confidence region of diameter of the 
order 

<i) +B kn + \\9-9^\\. 

Example 3.1 (Finite sequence model). In the finite sequence model of 
Example 1.2 with = M. n , we have bias B^ zero if we choose k = n. This 
leads to confidence sets of diameter of the order equal to the maximum of 
n^ 1 / 4 and the estimation error \\9 — 9^\\. 

As was shown by Li [32] and Baraud [1] the n~ 1//4 lower bound cannot be 
improved upon without losing full honesty. 

We can influence the term ||0 — 9^ n '\\ by choosing our favorite estimators 
9^ n \ For instance, we may choose any of the adaptive penalized minimum 
contrast estimators considered in [8]. As shown by Birge and Massart [8] 
we can adapt to large classes of a priori models by choosing appropriate 
penalties. 

One choice of penalties leads to estimators that, among other good prop- 
erties, satisfy, for every D, 

2 

sup E e \\9^-9\\ 2 < — 

6»G0d n 

where Op = {9 G M n : 7^ 0) < D}. The confidence sets centered at these 
estimators attain a uniform order equal to the maximum of n -1 / 4 and 
s/D/n + y/log(2n/D)/n. As long as D <C n this improves upon the order 
1 rate attained by the naive chi-square procedure, and we obtain the best 
possible rate ra^ 1 / 4 uniformly over every set @d with D < *Jn. Thus these 
excellent adaptation properties of 9^ n > result in smaller confidence regions, 
for more submodels, than those found in [1], pages 533-536, by a direct 
construction. 

The estimators Rk,n{9^) and fk, n ,6 m the preceding theorem depend on 
a 2 and hence so far we have implicitly assumed that (an upper bound on) 
the variance a 2 is known. The preceding remains true if it is replaced by a 
good estimator. 



-D + log 



2?) 

15 



+ 1 
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Theorem 3.2. The assertion of Theorem 3.1 remains true if a 2 in the 
definitions (3.1) and (3.2) of Rk, n o,nd fk n,e is replaced by estimators a 2 
such that 

supF e (Vk n \&l - a 2 \ > e | 0< n )) -» 0. 
6>ee 

Proof. Represent the observations as Xi = 9i + (<r/ v / n)ej for indepen- 
dent standard normal variables &i . It suffices to prove the uniform asymptotic 
normality of the variables 

Eii(e? - lW/ n + k(a 2 /n - a 2 /n) + (2a/^n) E^gj - _ 
'(2«7 4 A;/n 2 ) + (4*2/71) £*=i(0i - <?J n) ) 2 
Therefore, it suffices to prove that, uniformly in 9 G G, 

(33) 2a 4 /n 2 + 4a 2 /nfcEii(^-^ (n) ) 2 1 P Q 

2a 4 /n2 + (4a 2 M)Eii(^-^ (n) ) 2 

(3.4) ^ - £0 



2a 4 /n 2 + (4a 2 M) Ef=i(^i - CO 2 
The absolute value of the left-hand side of (3.3) can be rewritten in the form, 
for the constants C n , fc (0) = 2n/{ka 2 ) Ei=i(0* - A^) 2 , 



^ a 2 /a 2 + Cn, fc (g) 
a 2 " 







< 









i + c n , fc (e) 
p 

Thus this reduces to <t/<7 — > 1, uniformly in 0. Assertion (3.4) is true as soon 



as 



\[~k{6~ 2 — a 2 ) — ► 0, uniformly in 0. □ 



In the finite sequence model with = M. n there is no possibility to esti- 
mate a 2 , and the same is true in the infinite sequence model without some 
restriction on the parameter set G. On the other hand, in the infinite se- 
quence model with a restriction to regular parameters, estimation of a 2 is 
easy. 

Example 3.2 (Regular models). For given integers I and m consider 
the estimator a 2 = (n/l) EiUm+i X 2 ■ This has mean and variance given by 

m+l / 2 \ m+l 

^=7 X H + t X ■ 

i=m+l i=m+l 

^ n 2 ^ /4a 2 2 2a 4 \ Ana 2 ^ fl2 2a 4 \ 

i=m+l \ i=m+l / 
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It follows that the mean squared error over the regularity class S(/3, L) can 
be bounded as 

2 2\2 ^ 77,2 1 n 1 1 

In view of Theorem 3.2 we wish this to be of smaller order than 1/k. 

In the infinite sequence model with = S(f3, L) as the biggest model, 
we choose k = nVw+V 2 ) (cf. Example 2.2), and hence we must choose I 3> 
n i/(2/3+i/2) These values are shown for some values of /3 in Table 1. For the 
minimal value of I we must choose m > n 1 ^ 2 ^" 1-1 / 2 ) and then the estimator 
for a 2 becomes independent of Rk,n0^) ■ A variety of other combinations 
of (m, I) will do as well. 

In the finite sequence model with restricted to S(@,L), truncated to 
R n , the choice I > n 1 /^" 1 " 1 / 2 ) can be realized with I < n only if (3 > 1/4. We 
can then combine it with m of the order n 1 ^ 2 ^ -1-1 / 2 ). 



3.2. Nonnormal distributions. The assumed normality of the observa- 
tions X\ , X<i , . . . in the preceding section helps one to obtain precise critical 
values, but it is not important for the general ideas. In this section let 
Xi = 9i + (cr/y / n)ej for an i.i.d. sequence £i,£2, • • • with mean zero, variance 
1 and finite fourth moment. Then define Rk,n(@^) as i n (3-1) and define 
the variance estimator 



e^n >i-a. 



i=i v i=i 



Theorem 3.3. For any k and n, 

Rk,n0 in) )-lA=l(0i-0l n) ) 2 



inf P 6 

8&l 2 



T~k,r 



< 

'a 



Proof. The quantity Rk,nV> ) is an unbiased estimator of J2i=i(@i ~ 
n ^) 2 , and f| n g is equal to it 
by Chebyshev's inequality. □ 



f 1 '') 2 , and f| n g is equal to its variance. Therefore, the inequality follows 



The preceding theorem is based on Chebyshev's inequality, which is notably 
imprecise. However, this crude device costs only in terms of the constants 
and not in terms of the rate. If Z is exactly standard normal distributed, 
then we have that P(|-ZJ > 1.96) = 0.05, whereas the use of Chebyshev's 
inequality P(|Z| > M) < M~ 2 would replace the normal quantile 1.96 by 
M = 0.05" 1 / 2 w 4.5, so that the resulting confidence set would be a bit more 
than two times too wide. 
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For many estimators 8^ we can avoid this penalty, because the quantities 
Rk n {9^) will be asymptotically normal, at least under the overall proba- 
bility law governing the initial estimators 6^ and the observations X^ n \ 
This will depend on the initial estimators 9^ n \ but the following assump- 
tion appears to be reasonable. Assume that the initial estimators satisfy, for 
some sequence e n — > 0, 

k n 



(3.5) 



sup Pg max 1 8, 



(n) 



l<i<k n 



i=l 



}(«) 



0. 



Theorem 3.4. For any k n — > oo as n — > oo such that (3.5) holds, 



sup sup 



< X 



Proof. We can express the variable (Rk,n0^) ~J2i=i( 
in the form 

k k 
(3-6) Y, A k,n( )( £2 i ~ !) +Y B ^n{°) £ h 



-►0. 

>W\2 



)/Vi,n,t 



8=1 



8=1 



for the positive constants given by 

A k , n (0) = 



n T~k,n,i 



2a(6i - 9\ 



The terms in the sum (3.6) are conditionally independent under given 
6^ n \ and the sum has conditional mean and variance equal to and 1. If 
the terms of the sum also satisfy the conditional Lindeberg condition in 
probability, then the variables (3.6) converge conditionally in distribution, 
in probability. We wish to show that this is true uniformly in 8 G@. 

Thus it suffices to prove that for every k n — > oo, every 5 > and any 
sequence {9 n } C G as n — > oo, 

k n 

E 9 n ( A k n ,n(0n)(£ 2 i ~ l) + B i:knjn (6 n )£i) ±\ Akn ^o n ){e*-l)+B i , kntn {e n )e i \>8 ~> °- 



i=l 



For any c G [0, 1) and positive numbers A, B we have that (1 — c)(A 2 + B 2 ) < 
A 2 + B 2 — 2cAB. Because the correlation c between e\ and e\ is nonnegative 
and strictly smaller than 1, this inequality can be used to see that 



n 



i=i 
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Consequently, 

max (Al n (9) + Bf k „(<?)) < ~ + ~ f ^ , 

By assumption the right-hand side converges to zero in probability, as k = 
k n — > oo and n — > oo. We also have that J2i=i(^k nW + ^fc nW) * s un ^" 
formly bounded. We can conclude that the Lindeberg condition is satisfied. 

□ 

3.3. Exact simulation. The procedures in the preceding section can be 
implemented as soon as the lower-order moments of the errors £j are known 
(or can be estimated). If the full distribution of the errors is available, then 
we may also obtain exact, finite-sample confidence regions. This observation 
is even of interest in the case of Gaussian errors. 

The variable (R k , n (0 {n) ) ~ Ei=i(°i ~ ^ n) ) 2 )/h,n,9 can be written as a 
function Sk t n( E i, ■ ■ ■ i e n,6,9^), as in (3.6) in the proof of Theorem 3.4. This 
representation allows simulation of the distribution of the given variable 
under 9, for every fixed 6 8. Thus in principle we can find the a-quantile 
— z a (9) of this distribution, for every 9. Then C n given in Proposition 2.1, 
but with z a replaced by z a (6), is a valid (1 — a) -confidence region. 

Under the conditions of Theorem 3.4 the quantiles z a (9) converge to Gaus- 
sian quantiles, uniformly in 9. 

4. Density estimation. Suppose that we observe an i.i.d. sample X\ , . . . , X n 
from a density / relative to some measure /iona measurable space (X,A). 
Let 6 = (#i, O2, ■ ■ ■ ) be the Fourier coefficients of / relative to a given or- 
thonormal basis of Li{X,A, //), and let correspond to the collection of all 
densities deemed possible. Assume that the densities 9 £ © are uniformly 
bounded. 

Given an initial estimator 9^ our estimator for \\9 — 9^\\ 2 is given by 

n k 

R k Ao {n) ) = -7-3TT E EEteW -*< (B) )(*(*.) -eh- 

v ' rj±s=l 1=1 

Here k = k n is chosen dependent on Q. We combine this with the variance 
estimator 

2 2k\\f\\l 4}j/jj * - (n)2 

y ' i=i 



Theorem 4.1. For any k,n, 



»)]<!. 
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Proof. The estimator Rk yn {6^) is a [/-statistic of order 2 with kernel 
h(x,y) = J2i=i( e i( x ) ~ @\ i )( e i(y) ~ @i )• ^s mean is equal to 



Eh(X 1 ,X 2 ) = J2( 



,^ (n) ) 2 . 
i=i 

Its Hoeffding decomposition (e.g., [39], Section 11.4) is 



n 

r=l 



1 n 

Rk,n0 in) ) = ^h(X 1 ,X 2 ) + - ]T PlKX r 
1 

i(n- 1) 



(4-1) 



for the "kernel functions" given by 

A; 



P 1 h(x)=2j2(0i-0r)(ei(x)-e i ), 

i=l 
k 

Pi, 2 h(x,y) = £(ei(a;) - O^y) - 0*) 



i=i 

k k k 

= ^{x)ei{y) - 0Mx) + ei(y)) + ]T 6 l 

i=l i=l i=l 

The three terms of the Hoeffding decomposition and also each of the indi- 
vidual terms in its sums are uncorrelated. Furthermore, the variance of the 
last term in (4.1) is equal to 2/(n(n — 1)) var Pi i2 h(X±, X 2 ). 
The variance of a factor in the linear term can be bounded as 

w(P 1 h(Xi)) = 4E(^(e i -0W)e i (X 1 )) 



. i=i 



< 4n ; 



/ (l> - 2 = 4 ll/ll°°X> - *1 n) ) 2 > 
j \i=i / i=i 



by the orthonormality of the functions ej in L 2 (fJ>)- 

The variables T,i=i(ei{X 1 )-e i )(e i (X 2 )-e i ) and Eti l {e i (X 1 ) + ei (X 2 )) 
are uncorrelated and their sum is J2i=i ^i{Xi)ei(X 2 ) + J2i=i@i- K follows 
that 

k k 

vBx{P lt2 h{X x ,X 2 )) =var^e i (X 1 )e. t (X 2 ) - var£>(ei(.Yi) +e 4 (X 2 )). 

!=1 1=1 
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This becomes bigger if we leave out the second variance on the right and 
replace the first variance on the right by the second moment ~E(J2i=i e i(^i) x 
ei{X2)) 2 , which can be bounded by 

11/11- j J {Y.zi(x)zi{y)\ d l i{x)d t i{y) = k\\f\^ , 
by the orthonormality of the functions ei in L2 (//) . □ 
By Markov's inequality, if we choose z a = y/l/a, then 




The present variance t? g has exactly the same form as in Section 3, with 
H/lloo playing the role of a 2 . For more precision we can express ||/||oo in 9, 
and it is not necessary to know a uniform bound on the regression functions. 
The approximation (2.8) with t? q of the order as in (2.5) is again satisfied 
and Proposition 2.1 yields a confidence region of diameter of the order, with 
M a uniform bound on 0, 

+B kn + \\9-9^\\. 

The corollaries for, for example, regular models are the same. 

Depending on the basis functions a, the resulting confidence region can be 
tightened by using higher moments or exponential bounds. Finding an exact 
limit distribution appears to be not straightforward. Existing limit results 
for [/-statistics with changing kernels (e.g. [33]) are based on approximation 
of the kernel by a finite product kernel of fixed dimension. In our case the 
kernel is already in product form, but the increase in its dimension k is 
essential. 

5. Random regression. Suppose that we observe an i.i.d. sample (Xi,Y±), 
. . . , (X n ,Y n ) from the distribution of a vector (X,Y) described structurally 
as Y = f(X) + e, for (X, e) a random vector with E(e | X) = and a 2 (x) = 
E(e 2 I X = x) admitting a bounded version. The distribution Px of X is 
known and 9\ , 92, ■ ■ ■ are the Fourier coefficients of the regression function / 
relative to a given orthonormal basis ei,e2, • • • of L^iPx)- We assume that 
the set of regression functions is uniformly bounded. 

Given an initial estimator 9^ our estimator for \\9 — 9^\\ 2 is given by 

Rk,n0 (n) ) = "T^Tf EEE^TO - 9i n) )(Y sei (Xs) ~ 9\ n) ). 
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Here k = k n is chosen dependent on O . We combine this with the variance 
estimator 

2fc(imi 2 oo + Halloo) 2 , 4II/I&+4HS * in)2 
K ' 1=1 



Theorem 5.1. For any k,n, 



R k , n in) )-Eii(Oi-or) 



(«)\2\ 2 



sup Eg 

0£0 \\ Tk,n,e 



< 1. 



Proof. The proof is similar to the proof of Theorem 4.1. The variable 

Rk,n0^) is again a [/-statistic of order 2. It has mean X)i=i(^i — t - ) 2 and 
Hoeffding decomposition [cf. (4.1), but replace by (JQ,y)] with kernels 
of the form 

k 

Pih(x,y) = 2£(0 i - % n) )(yei(x) - t ), 

i=l 
k 



Pi,2h(xi,y 1 ,x 2 ,y2) = ^(yiej(xi) - 9i){y 2 ei(x 2 ) - 9i 
i=i 

k 

= J2yiy2ei{xi)ei(x 2 ) 



k k 

-^29i(yiei(xi) + y 2 ei(x 2 )) + ^9 2 . 
i=i i=i 

By the orthonormality of the functions ej and arguments as in the proof of 
Theorem 4.1, 



i=l 

vaxiV 2 /»(Xi,yi,x 2j y 2 ) < ||E(y 2 |x)|| 2 X) A ; . 



varPi/i^y) < 4||E(y 2 |X)llocE(^ " <f 

i=l 



loo 

From y = /(X) + e and E(e|X) = it follows that E(y 2 |X) = f 2 (X) + 
E(e 2 |X) < H/ll 2 ^ + 1 1 cr 2 1 1 oo - Combining the preceding bounds we obtain the 
theorem. □ 



The bound given by the preceding theorem is of the same form as the 
bounds given in the preceding sections, but with H/H 2 ^ + ||c 2 ||oo playing the 
role of a 2 in Section 3. Again (2.8) is justified with f| n e of the order as in 
(2.5). Proposition 2.1 gives the same corollaries for confidence regions. 
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6. Lower bounds. In this section we relate the minimum diameter of a 
confidence region to the minimax rates for testing and estimation. Consider 
a sequence of statistical experiments (Pa '■ 9 G 0) indexed by a parameter 
9 G in a metric space (0, d) and a submodel indexed by a subset ©i C 0. 
We are interested in the maximal diameter over ©i of confidence regions 
that are honest over the whole model 0. 

We shall silently understand that appropriate measurability assumptions 
regarding the confidence regions are satisfied. 

Given < a < f3 < 1, let e n be a sequence of positive numbers such that 
there exists no sequence of tests (j) n satisfying the two requirements, for some 
given subsets ni i C ©i, 

(6.1) limsup sup Pg (/} n <a, 

rwoo 6»Ge:<i(6»,e n ,i)>e n 

(6.2) limsup sup P (n) (l - <f> n ) < (3. 

n->oo 6»ee n ,i 

This can only be satisfied if a + (3 < 1, because otherwise the trivial test 
<j) n = a' for some a' with oi < a and 1 — a' < (3 satisfies (6.1)-(6.2). For 
(3 < 1 — a < 1, the condition is satisfied for e n equal to what Ingster [23] 
calls a rate of "not asymptotic indistinguishability of the hypotheses." The 
following lemma shows that the diameter over ©i of an honest confidence 
set is at least of the order e n . 



Lemma 6.1. For given < a < (3 < 1 and subsets C ©i, if there 
exists no sequence of tests 4> n satisfying (6.1)-(6.2), then for any sequence 
of confidence sets C n satisfying (1.1), 

limsup sup pjj n \diam.(C n ) > e n ) > f3 — a. 
n-»oo eeei 

Proof. Let @ nj o = {0 G © : d(9, @ n ,i) > £n}- Given a sequence of confi- 
dence sets C n satisfying (1.1) define tests by <f> n = Imq e )>o- 

If 9 G nj o and d(C n ,Q n fi) > 0, then 9 g' C n . Therefore, from (1.1) it is 
immediate that these tests satisfy (6.1). 

If 9 G n ,i, d(C n , © n ,o) = and 9 G C n , then diam(C n ) > e n . [Indeed, for 
every 6 > there exist points c G C n and 9 n G @ n ,o with d(c, 9 n ) < 5. By the 
definition of @ nj o we have d(9 n ,Q nj i) > e n and hence d(9 n ,9) > e n . By the 
triangle inequality d(c,9) > e n — 5.] It follows that, for every 9 G Q n ,i, 

P e {n \l - <j> n ) = P e (n \d(C n , n , o ) = 0) 

< P, (n) (diam(C' n ) > e n ) + P^ n) {9 $ C n ). 
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By (1.1) the second term on the right-hand side is strictly asymptotically 
smaller than a, uniformly in 9 E 0. If the first term on the right-hand side 
were asymptotically smaller than (3 — a, uniformly in 9 £ Q±, thus contra- 
dicting the assertion of the lemma, then the left-hand side would be asymp- 
totically strictly less than (3, so that the tests would also satisfy (6.2). □ 

To obtain a lower bound for sup 0G 1 Pg n \di&m(C n ) > e n ) we can apply 
the preceding lemma with & n x = ©i, but also with every subset of 0i. In 
particular, we may apply the lemma with a one-point set n .i = {6\}, for 
any 9\ € Q\. For regularity models 0, Ingster [23] characterizes the minimax 
rate for exactly these one-point problems. He shows that there exists a rate 
e* such that the sum of the error probabilities (6.1)-(6.2) goes to zero if 
e n /e* — > oo and goes to 1 if e n /e* n — ► 0. Thus the condition of the lemma is 
satisfied for any < a < (3 < 1 with a + (3 < 1 and e n with e n /£n ~~ ^ 0- The 
lemma then says that the weak limit points in [0, oo] of the distribution of 
diam((7 n )/e* have a component of size at least [3 — a concentrated on (0, oo]. 
In other words, the order of the diameter is at least e*. 

The relationship between the diameter of confidence regions and the min- 
imax rate for estimation is less perfect, due to the fact that the risk for 
estimation concerns the complete distribution of an estimator, whereas a 
confidence region at level 1 — a leaves a mass of size a completely undis- 
cussed. 

A key result is as follows. Let (3 > be given, and let e n be a sequence of 
positive numbers such that for every estimator sequence T n 

(6.3) liminf sup P^ n) (d{T n , 9) > e n ) > (3. 

Lemma 6.2. For given < a < (3 < 1, if (6.3) holds for every estimator 
sequence T n , then for any sequence of confidence sets C n satisfying (1.1), 

liminf sup Pi n Vdiam(C n ,) > e n ) > (3 — a. 

Proof. Given a sequence of confidence sets C n , define for each n an 
estimator T n to be an arbitrary point in C n . Then, for any 9 € ©i, 

p( n \d(T n ,9) > e n ) < P e (ra) (diam((7 n ) > e n ) + P, (n) (9 £ C n ). 

By (1.1) the second term on the right-hand side is asymptotically smaller 
than a, uniformly in 9 G 0. By assumption the liminf of the supremum of 
the left-hand side over 9 £ ©i is bounded below by (3. □ 

If we choose e n faster than the minimax rate, then typically (6.3) holds 
for some (3 > 0. In particular this is true if the minimax rate e* has the 
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property that for a "best" estimator sequence T n the sequence d(T n ,0)/e* 
has all its limit points on (0,oo]. In that case d(T n ,6)/e n — > oo, and the 
right-hand side of (6.3) is 1, for any sequence e n with e n je* n — ► 0. We may 
then apply the lemma with any /3 < 1. More generally, this argument works 
if the weak limit points of the sequence d(T n ,#)/e* in [0, oo] possess a point 
mass of at most /3 at 0. 
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